Skip to main content

Web Archive Datasets



rss RSS

8
RESULTS


Show sorted alphabetically

Show sorted alphabetically

SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
Geocities Datasets
Geocities Datasets
collection
1
ITEMS
777
VIEWS
collection

eye 777

Early Web Datasets
Early Web Datasets
collection
2
ITEMS
722
VIEWS
collection

eye 722

This collection contains various datasets generated from the "early web" era (1996-1999) of the Internet Archive's global web archive collection. Parallel Language Records of the Early Web (1996-1999) This dataset consists of multi-language URLs of the early web (1996-1999), grouped by common URL patterns. “Parallel language” refers to the same text represented in different languages; multi-language text from websites are a rich source for parallel language corpora and can be...
Friendster Datasets
Friendster Datasets
collection
2
ITEMS
843
VIEWS
collection

eye 843

This collection contains datasets generated from the Friendster web archive collection in the Internet Archive. Founded in 2002, Friendster was one of the more popular, early social networking sites. You can read more about Friendster and the archiving effort at  The Archive Team Friendster Snapshot Collection . Friendster WAT Dataset WAT stands for Web Archive Transformation , a file composed of key metadata from an archived web resources, such as provenance and archival...
Early Web Datasets
data

eye 161

favorite 0

comment 0

This dataset consists of metadata records identifying the language of webpages from the early web (1996-1999) portion of the Internet Archive’s web archive collection. The metadata includes specific capture information related to a URL along with the top 3 language annotations for each page, as detected by Google’s Compact Language Detector (CLD3) , a neural network model for language identification. Confidence scores are computed according to Internet Archive’s own Square-Leaf-Model,...
Geocities Datasets
by Nick Ruest
data

eye 313

favorite 10

comment 0

Web archive derivatives of the GeoCities collection from the Internet Archive (v3). The derivatives were created with the Archives Unleashed Toolkit 1.2.0, and align with the derivatives produced by Archives Research Compute Hub (ARCH). This updated dataset includes last_modified_date columns for a number of the derivatives, as well YYYYMMDD crawl_date format for the domain graph. The CSV derivatives include: Domain frequency domain count Domain graph crawl_date source target count Image graph...
Topics: csv, apache spark, geocities, web archives, archives unleashed, sparkling
Early Web Datasets
data

eye 288

favorite 2

comment 0

This dataset consists of multi-language URLs of the early web (1996-1999), grouped by common URL patterns. “Parallel language” refers to the same text represented in different languages; multi-language text from websites are a rich source for parallel language corpora and can be valuable in work such as machine translation. The URLs and patterns were derived and extracted from the Internet Archive’s Wayback Machine web archive, for all archived webpages successfully captured before year...
Friendster Datasets
data

eye 199

favorite 0

comment 0

Longitudinal Graph Analysis (LGA) files are archival web graph files that contain a complete list of what URIs link to what URIs, including URIs not included in the source dataset but linked to, along with a timestamp for each source snapshot in the archival collection. They are ~1% the size of a collection’s aggregate WARC files, and deliver as a GZip container of two file types: ID-Map & ID-Graph. Additional details on format, statistics, samples, et cetera can be found in the ...
Friendster Datasets
data

eye 192

favorite 0

comment 0

WAT stands for Web Archive Transformation , a file composed of key metadata from an archived web resources, such as provenance and archival capture information, key text such as meta tags and anchor text, link data, and other essential metadata and information. WAT records are extracted from WARC records -- WARC being the ISO-standard for web archive files -- with each WAT file mapping one-to-one to the corresponding WARC file. WAT formats metadata into JavaScript Object Notation (JSON). The...